Ord kan dann clustre via en kollokasjonsanalyse. Her litt engelsk forklaring:
The term collocation has traditionally been restricted to words that are juxtaposed together as phrases like “strong coffee”, “strict regime” or “eat dinner”. Here we take collocations to be realized as skipgrams, or as word pairs that simply cooccur within a context which in itself is a contiguous sequence of words, typically a paragraph or a window of n words around a given word. Juxtaposed collocates will also be part of the result set.
The collocates for W is the words that are associated with W based on a measure of association. The purpose of such a measure for a word W (e.g. “democracy”) is to provide a means of collecting associated words in the discourses in which W occurs within C. Collocations can be viewed as collecting discourse markers for W, in the sense that the collocates are uttered (written or spoken) together with W. The collocates for W are computed via an association measure on the set of all cooccurrent words.
Start med å importere de kommandoene som trengs fra nbtext
import nbtext as nb
from nbtext import cloud, get_urn, Cluster, Corpus
%matplotlib inline
word='abort'
korpus = 'avis'
exp=1.06
brev_cluster_1800 = Cluster(
'Abortus',
period=(1810, 1900),
before=5,
after=5,
corpus=korpus,
reference=150,
word_samples=500)
brev_cluster_00_50 = Cluster(
word,
period=(1900, 1950),
before=5,
after=5,
corpus=korpus,
reference=150,
word_samples=500)
brev_cluster_50_70 = Cluster(
word,
period=(1950, 1970),
before=5,
after=5,
corpus=korpus,
reference=150,
word_samples=500)
brev_cluster_70_80 = Cluster(
word,
period=(1970, 1980),
before=5,
after=5,
corpus=korpus,
reference=150,
word_samples=500)
brev_cluster_80_90 = Cluster(
word,
period=(1980, 1990),
before=5,
after=5,
corpus=korpus,
reference=150,
word_samples=500)
brev_cluster_90_00 = Cluster(
word,
period=(1990, 2000),
before=5,
after=5,
corpus=korpus,
reference=150,
word_samples=500)
Hvilke ord er knyttet til clustringsordet, er det variasjon i kontekst, trengs det mer data. Er resultatet tolkbart?
brev_cluster_1800.cluster_set(top=250, exponent=exp)
brev_cluster_00_50.cluster_set(top=250, exponent=exp)
brev_cluster_50_70.cluster_set(top=250, exponent=exp)
brev_cluster_70_80.cluster_set(top=250, exponent=exp)
brev_cluster_80_90.cluster_set(top=250, exponent=exp)
brev_cluster_90_00.cluster_set(top=250, exponent=exp)
Clustret kan studeres som en ordsky
kommandoen for å tegne en ordsky er cloud(). Argumentet kan være mange ting, så lenge det kobler et ord til et tall. Datarammer som består av en kolonne fungerer fint.
nb.cloud(brev_cluster_1800.cluster_set(aslist=False, exponent=exp)[:150], background='black')
nb.cloud(brev_cluster_00_50.cluster_set(aslist=False, exponent=exp)[:150], background='black')
nb.cloud(brev_cluster_50_70.cluster_set(aslist=False, exponent=exp)[:150], background='black')
nb.cloud(brev_cluster_70_80.cluster_set(aslist=False, exponent=exp)[:150], background='black')
nb.cloud(brev_cluster_80_90.cluster_set(aslist=False, exponent=exp)[:150], background='black')
nb.cloud(brev_cluster_90_00.cluster_set(aslist=False)[:150], background='black')